<<<<<<< HEAD ======= >>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8

Motivation

In contemporary times, social media has assumed a multifaceted role. Individuals utilize platforms like YouTube not only for educational purposes but also for listening to music and staying informed with current news. This versatility makes YouTube a crucial player in shaping digital culture and trends. Consequently, understanding the factors that lead to a YouTuber’s success is vital, as it influences not only individual creators but also the broader media landscape. By analyzing data from top YouTubers across various countries, the project aims to provide valuable insights for content creators and advertisers to enhance their engagement and profitability. The primary dataset obtained from Kaggle comprises information from the top 995-ranked YouTubers in the global YouTube community. We utilize advanced data visualization techniques to deeply analyze top YouTubers’ profiles, examining their demographic distribution and the relationships among various success-related variables. Moreover, our project includes the development of a predictive model designed to quantify and forecast YouTubers’ success, offering valuable insights into the future trends of content creation and audience engagement on the platform.

Initial questions

  • 1.Define what “success” means in the context of a YouTuber (Subscribers, Views per video or Earnings)?
  • 2.What are the factors that influence the success of YouTubers?
  • 3.Develop a predictive models to forecast YouTubers’ success.

Data Cleaning and Preprocessing

We imported the data and apply the function janitor::clean_names to convert all the variable names to lower case and puts underscores in the gaps. Since YouTube was created in 2005, we eliminated 6 accounts created before 2005, which might be data entry error. The ‘country’ and ‘category’ variables have been encoded as factors to facilitate categorical analysis. We also add two new variables ‘video_per_upload’ and ‘earning_differences’. For the consideration of data integrity, NA are kept for not a significant reduction in dataset size, and we will process the NA values respecively in each visualization and model fitting. The cleaned dataset comprises 588 observations across 20 variables which are integral for our exploratory data analysis and model fitting processes:

  • id
  • subscribers: Number of subscribers to the channel
  • video views: Total views across all videos on the channel
  • category: Category or niche of the channel
  • uploads: Total number of videos uploaded on the channel
  • country: Country where the YouTube channel originates
  • channel_type: Type of the YouTube channel (e.g., individual, brand)
  • video_views_rank: Ranking of the channel based on total video views
  • country_rank: Ranking of the channel based on the number of subscribers within its country
  • channel_type_rank: Ranking of the channel based on its type (individual or brand)
  • video_views_for_the_last_30_days: Total video views in the last 30 days
  • lowest_monthly_earnings: Lowest estimated monthly earnings from the channel
  • highest_monthly_earnings: Highest estimated monthly earnings from the channel
  • lowest_yearly_earnings: Lowest estimated yearly earnings from the channel
  • highest_yearly_earnings: Highest estimated yearly earnings from the channel
  • subscribers_for_last_30_days: Number of new subscribers gained in the last 30 days
  • created_year: Year when the YouTube channel was created
  • Population: Total population of the country
  • latitude: Latitude coordinate of the country’s location
  • longitude: Longitude coordinate of the country’s location
  • video_per_upload: Average number of video views per video upload
  • earning_differences: Range of yearly earnings for each channel, calculated by subtracting lowest_yearly_earnings from highest_yearly_earnings.
# Make sure 'country' and 'category' are factors.
cleaned_df$country <- as.factor(cleaned_df$country)
cleaned_df$category <- as.factor(cleaned_df$category)

# Add new variables 'video_per_upload' and 'earning_differences'
cleaned_df$video_per_upload <- with(cleaned_df, video_views / uploads)
cleaned_df$earning_differences <- with(cleaned_df, highest_yearly_earnings - lowest_yearly_earnings)

In order to preserve data integrity and avoid a significant reduction in dataset size, we have opted to retain NA values within the dataset. Each instance of NA will be addressed individually in subsequent stages of our analysis, ensuring that they are appropriately managed during both the visualization and model fitting processes.

EDA

<<<<<<< HEAD

1. Category variable exploration

    1. How does the top youtubers spreaded over countries?
channel_counts_by_location <- cleaned_df|>
  drop_na(c(latitude, longitude)) |>
  group_by(country, longitude, latitude) |>
  summarise(channel_count = n())

world_map <- leaflet() |>
  addTiles() |>
  addMarkers(
    data = channel_counts_by_location,
    ~longitude, ~latitude,
    label = ~paste(country, ": ", channel_count, " channels"),
    popup = TRUE
  )

world_map

We find that the top five ranked YouTubers are from the United States (N = 311), followed by India (N = 168), Brazil (N = 61), the United Kingdom (N = 43), and Mexico (N = 33).

    1. How are the distribution of numerical variables.
=======

1. Univariate Analysis

How are the distribution of numerical variables?

>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
youtube_df <- cleaned_df

all_columns <- colnames(youtube_df)
columns_to_plot <- all_columns[!all_columns %in% c("id", "category","country","abbreviation","channel_type","population","latitude","longitude","created_year")]


numeric_data_long <- 
  youtube_df[, columns_to_plot] %>% 
  gather(key = "variable", value = "value")


# Create a single plot with facets for each numeric variable
p <- ggplot(numeric_data_long, aes(x = value)) +
  geom_histogram(aes(y = ..density..),bins = 15, fill = "#8dab7f", alpha = 0.8) +
  geom_density(color="#6b8e23")+
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  scale_x_continuous(labels = scales::comma) +
  theme_minimal(base_size = 10) +  
  theme(
    strip.text.x = element_text(size = 10, face = "bold"), 
    axis.text.x = element_text(angle = 20, hjust = 1, vjust = 1,size=7,face = "bold"), # Angle x-axis labels for readability
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold"),
    plot.margin = margin(1, 1, 1, 1, "cm"), # Adjust the plot margins
    strip.background = element_blank(),
    panel.spacing = unit(3, "lines")
  ) +
  labs(
    title = "Distribution of Numeric Variables",
    x = "Value",
    y = "Frequency",
    caption = "Source: YouTube Data"
  )
# Convert to an interactive plot
ggplotly(p)
<<<<<<< HEAD
=======
>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8

We create interactive plots by applying ploty to visualize the density distribution of numerical variables. Upon observing right skewness, we apply a logarithmic transformation to these numeric values.

p <- ggplot(numeric_data_long, aes(x = log(value+1))) +
  geom_histogram(aes(y = ..density..),bins = 15, fill = "#8dab7f", alpha = 0.8) +
  geom_density(color="#6b8e23")+
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  scale_x_continuous(labels = scales::comma) +
  theme_minimal(base_size = 10) +  
  theme(
    strip.text.x = element_text(size = 10, face = "bold"), 
    axis.text.x = element_text(angle = 20, hjust = 1, vjust = 1,size=7,face = "bold"), # Angle x-axis labels for readability
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold"),
    plot.margin = margin(1, 1, 1, 1, "cm"), # Adjust the plot margins
    strip.background = element_blank(),
    panel.spacing = unit(3, "lines")
  ) +
  labs(
    title = "Distribution of Numeric Variables",
    x = "Value",
    y = "Frequency",
    caption = "Source: YouTube Data"
  )
# Convert to an interactive plot
ggplotly(p)
<<<<<<< HEAD
    1. How are the numeric variables correlated?
# Calculate the correlation matrix
cor_matrix <- cor(youtube_df[, columns_to_plot], use = "complete.obs")

fig <- plot_ly(x = colnames(cor_matrix), y = rownames(cor_matrix), z = cor_matrix, 
               type = "heatmap",colorscale ="Greens"  , zmin = -1, zmax = 1)

fig <- fig %>% layout(
  yaxis = list(autorange = "reversed"),
  width=800,
  height=600,
  title = "Correlation Matrix")

fig

The heat map depicts the Pearson correlation between continuous variables, which reveals a relatively high correlation between the variables Subscribers and Video Views (\(r\) = 0.85). The correlation of these two variables with the others is at a moderately weak level (\(r\) around 0.46), with no correlation to the Uploads variable (\(r\) = 0.08 and 0.15). Notably, the variables Lowest Earnings by year and month and Highest Earnings by year and month exhibit an absolute correlation of nearly 100%.

=======
>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8

Channel created year summary

year_created_plot <- plot_ly(cleaned_df, x = ~created_year, type = "histogram", 
                marker = list(color = "#B3CDD1", line = list(color = "white", width = 1)),
                nbinsx = 30)

# Update layout
year_created_plot <- year_created_plot |>
  layout(
    title = "Distribution of Channel Creation Years",
    xaxis = list(title = "Year of Creation"),
    yaxis = list(title = "Number of Channels"),
    showlegend = FALSE,
    template = "plotly_white"  # Optional: Set a template for the plot
  )


year_created_plot
<<<<<<< HEAD

The summary plot we generated showcases the relationship between the year of channel creation on the x-axis and the corresponding number of channels on the y-axis. Notably, the data reveals a pronounced peak in the year 2014, with 66 channels coming into existence during that period. This peak suggests a surge in YouTube channel creation, indicating potential shifts in content creation trends, platform popularity, or other influential factors during that specific year. Furthermore, our analysis indicates a sustained period of notable channel creation from 2011 to 2016, highlighting a consistent and relatively high annual rate of channel initiation within this timeframe. - Word Cloud

=======

There is an initial growth in the number of channels created from 2005 onwards, which is expected as YouTube was founded in February 2005 and gradually gained popularity.

A peak in channel creation appears to occur in the early 2010s, which may correspond with YouTube’s rise in global accessibility and the platform becoming a viable career option for content creators.

Post-2015, there’s a noticeable decline in new channel creation. This could be due to market saturation or content creators choosing to diversify onto emerging platforms.

Word Cloud

>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
category_data <- youtube_df %>%
  filter(!is.na(category) & category != "nan") %>%
  count(category) %>%
  mutate(n=n*30) %>% 
  ungroup()


category_data$scaled_size <- log(category_data$n + 1) # adding 1 to avoid log(0)

wordcloud_plot <- ggplot(category_data, aes(label = category, size = scaled_size)) +
  geom_text_wordcloud(
    aes(color = n),
    shape = 'circle',
    rm_outside = TRUE
  ) +
  scale_size_area(max_size = 10) + 
  scale_color_gradient(low = "#ffcc99", high = "#8dab7f") +
  theme_void(base_family = "sans") +
  theme(legend.position = "none", 
        plot.margin = margin(1, 1, 1, 1, "cm")) # Adjust margins around the plot

# Display the plot
wordcloud_plot

We exclude the NaN values in the category, and modify the frequency n of each category. The most frequently used categories, as observed from the word cloud chart, include Entertainment, Music, People & Blogs, and Gaming.

Pie Chart

channel_type_counts <- table(youtube_df$channel_type)

channel_type_counts <- youtube_df %>%
  group_by(channel_type) %>%
  summarise(count = n()) %>%
  ungroup()

color <- c("#ffcc99","#ffe4b5", "#ffd180","#ffa07a","#d1d17a", "#8dab7f", "#D2DFD9", "#A8C0B5", "#D1B9CB", "#B3CDD1", "#BBC1D0", "#E8C3C3","#C7CEBD", "#D2DFD9","#6b8e23")

# Create a pie chart using plotly with the custom colors
fig <- plot_ly(channel_type_counts, labels = ~channel_type, values = ~count, type = 'pie',
               textinfo = 'label+percent',
               insidetextorientation = 'radial',
               marker = list(colors = color))
fig %>% 
  layout(title = 'Pie Chart of Channel Types',
         showlegend = FALSE,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
<<<<<<< HEAD
=======
>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8

In this interactive pie chart, we demonstrate the frequency of proportion in Channel type. The most frequently viewed channel, as observed from the word cloud chart, include Entertainment, Music, People & Blogs, Gaming and Comedy.

<<<<<<< HEAD
  • Bubble Chart: Subscribers vs. Video Views by Category
size_factor <- 10^(-8.7) # Adjust this factor as needed to scale sizes up or down

youtube_df %>%
  filter(!is.na(category) & category != "nan") %>%
  plot_ly(
    x = ~subscribers, 
    y = ~video_views, 
    size = ~video_views_for_the_last_30_days * size_factor, 
    color = ~category,
    text = ~category,
    hoverinfo = 'text+x+y',
    type = 'scatter',
    mode = 'markers',
    marker = list(
      sizemode = 'area',
      sizeref = 2 * max(youtube_df$video_views_for_the_last_30_days * size_factor)/100
    )
  ) %>%
  layout(
    title = 'Subscribers vs. Video Views by Category',
    xaxis = list(type = 'log', title = 'Subscribers'),
    yaxis = list(type = 'log', title = 'Video Views (in billions)'),
    hovermode = 'closest',
    showlegend = TRUE
  )
=======

2. Multivariate Analysis

Mean and median earning differences by YouTube content category

summary_data <- cleaned_df |>
  mutate(earning_diff = highest_yearly_earnings - lowest_yearly_earnings)|>
  filter(category != "nan")|>
  group_by(category) |>
  summarize(mean_earning_diff = mean(earning_diff),
            median_earning_diff = median(earning_diff),)

# Create a bar plot using Plotly
plot <- plot_ly(data = summary_data, x = ~category, type = 'bar',
                y = ~mean_earning_diff, name = 'Mean Earning Difference',marker = list(color = "#F4E1C1")) %>%
  add_trace(y = ~median_earning_diff, name = 'Median Earning Difference', marker = list(color = "#C7CEBD"))

# Add layout details
plot <- plot|>
  layout(title = 'Earning Difference Summary by Category',
               xaxis = list(title = 'Category'),
               yaxis = list(title = 'Earning Difference'),
         legend = list(orientation = 'h', x = 0.5, y = -0.3, xanchor = 'center', yanchor = 'top'))

# Show the plot
plotly::layout(plot, layout)

For most categories, the mean earning difference is higher than the median, suggesting that a few channels with substantially higher earnings may be skewing the mean upwards.

Top 15 YouTube Channels by Highest Yearly Earnings

>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
color <- c("#ffe4b5", "#ffa07a","#d1d17a" , "#D2DFD9", "#A8C0B5", "#D1B9CB", "#B3CDD1", "#BBC1D0", "#E8C3C3")

earning_plot_data <-
  read_csv("Data/Global YouTube Statistics.csv",locale = locale(encoding = "Windows-1252"))  %>%
  janitor::clean_names() %>% 
  drop_na() %>% 
  select(youtuber, channel_type, highest_yearly_earnings) %>% 
  mutate(youtuber = stringi::stri_replace_all_regex(youtuber, "[^\x01-\x7F]", "")) %>% 
  arrange(desc(highest_yearly_earnings)) %>%
  top_n(15, highest_yearly_earnings)



plot_ly(earning_plot_data, x = ~highest_yearly_earnings, y = ~youtuber, 
                type = 'bar', orientation = 'h',
                color = ~channel_type, colors = color,
                text = ~paste('$', formatC(highest_yearly_earnings, format = "d", big.mark = ",")),
                textposition = 'inside',
                insidetextanchor = 'end', 
                textfont = list(color = 'white'), # text color
                hoverinfo = 'text',
                hovertemplate = paste('<b>Youtuber:</b> %{y}<br>',
                                      '<b>Earnings:</b> $%{x}<extra></extra>')) %>%
  layout(title = 'Top 15 YouTube Channels by Highest Yearly Earnings',
         xaxis = list(title = 'Yearly Earnings ($)'),
         yaxis = list(title = ''),
         showlegend = TRUE,
         margin = list(l = 100, r = 25, t = 50, b = 50),
         font = list(family = "Arial, sans-serif", size = 12, color = "#333333"))
<<<<<<< HEAD
=======
>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8

For the Top 15 YouTube Channels by Highest Yearly Earnings, first and foremost, KIMPO has the highest earnings at 163,400,400 US dollars in 2023, which is triple of the lowest earnings(59,800,000 dollars) of dednahype. Secondly, Entertainment is still the most predominant category among these channels. In the 15 YouTube Channels, animal and comedy have respectably only one position.

perUploadData <- cleaned_df |>
  mutate(viewsPerUpload = video_views / uploads) |>
  filter(viewsPerUpload < 600000000)
scatter_plot_perUpLoad <- plot_ly(data = perUploadData, x = ~viewsPerUpload, y = ~highest_yearly_earnings, mode = 'markers')


scatter_plot_perUpLoad <- scatter_plot_perUpLoad |>
  layout(title = 'Relationship between Views per Upload and Highest Yearly Earning',
               xaxis = list(title = 'Views per Upload'),
               yaxis = list(title = 'Highest Yearly Earnings'))


scatter_plot_perUpLoad

The scatterplot illustrates a discernible but weak downward slope between the highest earning and the number of YouTube views per upload. This implies that, in general, channels with a higher number of views per upload tend to have slightly lower earnings. While the relationship is present, its strength is limited, indicating that other factors beyond views per upload are likely influencing the earnings of YouTube channels.

How does the top youtubers spread over countries?

channel_counts_by_location <- cleaned_df|>
  drop_na(c(latitude, longitude)) |>
  group_by(country, longitude, latitude) |>
  summarise(channel_count = n())

world_map <- leaflet() |>
  addTiles() |>
  addMarkers(
    data = channel_counts_by_location,
    ~longitude, ~latitude,
    label = ~paste(country, ": ", channel_count, " channels"),
    popup = TRUE
  )

world_map

We find that the top five ranked YouTubers are from the United States (N = 311), followed by India (N = 168), Brazil (N = 61), the United Kingdom (N = 43), and Mexico (N = 33).

How are the numeric variables correlated?

# Calculate the correlation matrix
cor_matrix <- cor(youtube_df[, columns_to_plot], use = "complete.obs")

fig <- plot_ly(x = colnames(cor_matrix), y = rownames(cor_matrix), z = cor_matrix, 
               type = "heatmap",colorscale ="Blues"  , zmin = -1, zmax = 1)

fig <- fig %>% layout(
  yaxis = list(autorange = "reversed"),
  width=800,
  height=600,
  title = "Correlation Matrix")

fig

The correlation coefficient values range from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation. The heatmap indicates strong positive correlations between subscribers and video views, as well as among various earnings metrics, suggesting that higher subscriber counts are associated with more views and consistent earnings patterns. Other variables like video_per_upload show moderate correlations with recent engagement metrics, while earning_differences exhibit a lower correlation with viewership-related variables, hinting at additional factors influencing earnings beyond just subscriber and view counts.

Bubble Chart: Subscribers vs. Video Views by Category

size_factor <- 10^(-8.7) # Adjust this factor as needed to scale sizes up or down

youtube_df %>%
  filter(!is.na(category) & category != "nan") %>%
  plot_ly(
    x = ~subscribers, 
    y = ~video_views, 
    size = ~video_views_for_the_last_30_days * size_factor, 
    color = ~category,
    text = ~category,
    hoverinfo = 'text+x+y',
    type = 'scatter',
    mode = 'markers',
    marker = list(
      sizemode = 'area',
      sizeref = 2 * max(youtube_df$video_views_for_the_last_30_days * size_factor)/100
    )
  ) %>%
  layout(
    title = 'Subscribers vs. Video Views by Category',
    xaxis = list(type = 'log', title = 'Subscribers'),
    yaxis = list(type = 'log', title = 'Video Views (in billions)'),
    hovermode = 'closest',
    showlegend = TRUE
  )

The bubble chart implies the relationship between the number of subscribers (on the x-axis) and video views (on the y-axis) for YouTube channels: more subscribers tend to have a higher number of total video views. The size of the bubbles represent the video views in the last 30 days, and color hue represents the category. Certain categories like Music, Entertainment, and Gaming appear more frequently among the channels with higher views and subscribers, reflecting their popularity on Youtube.

Relationship between Views per Upload and Highest Yearly Earning

perUploadData <- cleaned_df |>
  mutate(viewsPerUpload = video_views / uploads) |>
  filter(viewsPerUpload < 600000000)

scatter_plot_perUpLoad <- ggplot(data = perUploadData, aes(x = viewsPerUpload, y = highest_yearly_earnings)) +
  geom_point(alpha=0.7,color="#8dab7f") +
  geom_smooth(method = "lm", color = "#ffa07a") + 
  labs(title = 'Scatter Plot of Highest Yearly Earnings vs. Views per Upload for Content Creators',
       x = 'Views per Upload',
       y = 'Highest Yearly Earnings')+
  theme_minimal()

# Print the plot
print(scatter_plot_perUpLoad)

The scatterplot illustrates a discernible but weak downward slope between the highest earning and the number of YouTube views per upload. This implies that, in general, channels with a higher number of views per upload tend to have slightly lower earnings. While the relationship is present, its strength is limited, indicating that other factors beyond views per upload are likely influencing the earnings of YouTube channels.

Model

1. Fit the multiple linear regression model of subscribers

Model statement: \(subscribers =\beta_0+ \beta_1country +\beta_2category +\beta_3videoperupload + \beta_4uploads + \beta_5(videoviews)\)

youtube_df <- youtube_df %>% drop_na()
# Fit the Multiple Linear Regression model++uploads +video_views, data = youtube_df)
mlr_model <- lm(subscribers ~ country + category + video_per_upload +uploads +video_views , data = youtube_df)
  
youtube_df %>% 
  modelr::add_predictions(mlr_model) %>% 
  ggplot(aes(x = earning_differences, y = pred)) +
  geom_point() +
  labs(
        title = "Multivariate Linear Model",
        x = "earning_differences",
        y = "subscribers") +
  theme_pubclean()
<<<<<<< HEAD

check_model(mlr_model, check = c("linearity", "outliers", "qq", "normality"))

=======

check_model(mlr_model, check = c("linearity", "outliers", "qq", "normality"))

>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
# Summary of the model
mlr_model%>% 
broom::tidy()%>% 
knitr::kable(digits=3) 
term estimate std.error statistic p.value
(Intercept) 9376280.100 7999090.827 1.172 0.242
countryAustralia -5472212.306 8078440.308 -0.677 0.498
countryBarbados 8869689.318 10470708.869 0.847 0.397
countryBrazil 1063831.738 3513446.361 0.303 0.762
countryCanada 739423.789 5202276.951 0.142 0.887
countryChile 9890517.481 6702783.611 1.476 0.141
countryChina 4306410.207 10860207.744 0.397 0.692
countryColombia 628666.001 4555343.991 0.138 0.890
countryCuba -15208894.389 39353393.646 -0.386 0.699
countryEcuador 482942.941 7727581.887 0.062 0.950
countryEgypt 29112.446 7869216.335 0.004 0.997
countryEl Salvador 26539013.988 10547644.419 2.516 0.012
countryFrance 855.337 5886668.193 0.000 1.000
countryGermany -3845926.857 5872522.560 -0.655 0.513
countryIndia 2230051.232 3157552.062 0.706 0.480
countryIndonesia 3667971.184 3822919.512 0.959 0.338
countryItaly -705999.140 7722280.705 -0.091 0.927
countryJapan -4897660.367 5853688.097 -0.837 0.403
countryJordan -5716738.222 6535116.343 -0.875 0.382
countryKuwait 16638412.903 10548675.332 1.577 0.115
countryLatvia -10711499.813 10504680.653 -1.020 0.308
countryMalaysia -1391640.694 10592297.201 -0.131 0.896
countryMexico 2603524.238 3914550.883 0.665 0.506
countryNetherlands 1117730.024 7732549.460 0.145 0.885
countryPakistan -1808473.884 5213982.019 -0.347 0.729
countryPhilippines -282333.291 4837637.524 -0.058 0.953
countryRussia -46260.280 4147480.887 -0.011 0.991
countrySamoa -5072846.712 10512137.650 -0.483 0.630
countrySaudi Arabia 46119.008 5158210.195 0.009 0.993
countrySingapore -6983564.461 7751747.825 -0.901 0.368
countrySouth Korea 11422156.311 4532264.675 2.520 0.012
countrySpain -701473.589 4377564.892 -0.160 0.873
countrySweden 840319.852 7781547.211 0.108 0.914
countrySwitzerland -1172193.393 11106777.278 -0.106 0.916
countryThailand -3951173.515 4195676.870 -0.942 0.347
countryTurkey -12238300.878 6562702.799 -1.865 0.063
countryUkraine -1572524.656 5485703.370 -0.287 0.774
countryUnited Arab Emirates 1283973.834 5019181.377 0.256 0.798
countryUnited Kingdom 267042.876 3617711.558 0.074 0.941
countryUnited States 1231887.262 3131795.390 0.393 0.694
countryVenezuela 10814024.934 10522436.612 1.028 0.305
countryVietnam -3329160.717 7703918.031 -0.432 0.666
categoryComedy 1089754.640 7534028.409 0.145 0.885
categoryEducation 831486.212 7619585.450 0.109 0.913
categoryEntertainment 1300231.209 7450481.786 0.175 0.862
categoryFilm & Animation 884298.396 7619659.355 0.116 0.908
categoryGaming 66210.665 7541954.178 0.009 0.993
categoryHowto & Style 974932.248 7962424.542 0.122 0.903
categoryMovies 6606348.817 10262823.888 0.644 0.520
categoryMusic 1582993.394 7462956.180 0.212 0.832
categorynan 520376.447 7553848.246 0.069 0.945
categoryNews & Politics 3920532.241 7970252.886 0.492 0.623
categoryNonprofits & Activism 14737864.453 10298596.694 1.431 0.153
categoryPeople & Blogs 1346729.750 7472123.599 0.180 0.857
categoryPets & Animals -2129591.804 9412383.766 -0.226 0.821
categoryScience & Technology 4641578.970 7991011.762 0.581 0.562
categoryShows -2025300.247 8008875.468 -0.253 0.800
categorySports 6076125.766 8283199.028 0.734 0.464
categoryTrailers 11292393.086 10264244.021 1.100 0.272
video_per_upload 0.001 0.002 0.782 0.434
uploads -27.510 13.321 -2.065 0.039
video_views 0.001 0.000 36.598 0.000
mlr_model %>% 
broom::glance() %>%
knitr::kable(digits=3)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.754 0.725 9982695 26.429 0 61 -10278 20682 20957.73 5.241811e+16 526 588

2. Fit the multiple linear regression model of earning_differences

Model statement: \(earningdifferences =\beta_0+ \beta_1country +\beta_2category +\beta_3videoperupload + beta_4uploads + beta_5(videoviews)\)

mlr_model_1 <- lm(earning_differences ~ country + category + video_per_upload +uploads +video_views , data = youtube_df)
  
  youtube_df %>% 
  modelr::add_predictions(mlr_model_1) %>% 
  ggplot(aes(x = earning_differences, y = pred)) +
  geom_point() +
  labs(
        title = "Multivariate Linear Model",
        x = "earning_differences",
        y = "predictions") +
  theme_pubclean()
<<<<<<< HEAD

check_model(mlr_model_1, check = c("linearity", "outliers", "qq", "normality"))

=======

check_model(mlr_model_1, check = c("linearity", "outliers", "qq", "normality"))

>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
# Summary of the model
mlr_model_1%>% 
broom::tidy() %>%
knitr::kable(digits=3)
term estimate std.error statistic p.value
(Intercept) 7505843.615 9623358.716 0.780 0.436
countryAustralia 3286609.208 9718820.630 0.338 0.735
countryBarbados -7937853.449 12596855.022 -0.630 0.529
countryBrazil -651672.832 4226874.703 -0.154 0.878
countryCanada -2344415.371 6258633.427 -0.375 0.708
countryChile -5990012.829 8063827.812 -0.743 0.458
countryChina -3523664.107 13065444.200 -0.270 0.788
countryColombia 1621029.024 5480336.485 0.296 0.768
countryCuba 32634726.214 47344358.496 0.689 0.491
countryEcuador -4753540.336 9296718.104 -0.511 0.609
countryEgypt -4685984.291 9467112.356 -0.495 0.621
countryEl Salvador -7529566.868 12689412.841 -0.593 0.553
countryFrance -5545773.340 7081994.802 -0.783 0.434
countryGermany 5023180.904 7064976.807 0.711 0.477
countryIndia -1103754.350 3798713.731 -0.291 0.772
countryIndonesia -3702317.503 4599188.407 -0.805 0.421
countryItaly 17725595.980 9290340.483 1.908 0.057
countryJapan 6095892.544 7042317.882 0.866 0.387
countryJordan -7667463.169 7862114.605 -0.975 0.330
countryKuwait 973103.211 12690653.087 0.077 0.939
countryLatvia 37811642.749 12637724.999 2.992 0.003
countryMalaysia -4297556.430 12743132.664 -0.337 0.736
countryMexico -3513157.824 4709426.130 -0.746 0.456
countryNetherlands -5223709.855 9302694.376 -0.562 0.575
countryPakistan 7363090.765 6272715.286 1.174 0.241
countryPhilippines -7033588.711 5819951.572 -1.209 0.227
countryRussia -3850992.022 4989654.101 -0.772 0.441
countrySamoa -4268565.526 12646696.189 -0.338 0.736
countrySaudi Arabia -7073939.302 6205618.627 -1.140 0.255
countrySingapore -13127214.056 9325791.095 -1.408 0.160
countrySouth Korea 18564877.388 5452570.762 3.405 0.001
countrySpain -3992247.305 5266458.173 -0.758 0.449
countrySweden 1087634.776 9361641.442 0.116 0.908
countrySwitzerland -5560344.074 13362081.296 -0.416 0.677
countryThailand -9277039.168 5047636.594 -1.838 0.067
countryTurkey 9189699.113 7895302.673 1.164 0.245
countryUkraine -5152068.187 6599611.441 -0.781 0.435
countryUnited Arab Emirates 2677976.449 6038359.096 0.443 0.658
countryUnited Kingdom -4138003.116 4352311.632 -0.951 0.342
countryUnited States -2187652.475 3767726.998 -0.581 0.562
countryVenezuela -9302188.332 12659086.424 -0.735 0.463
countryVietnam -1064136.589 9268249.147 -0.115 0.909
categoryComedy 290642.811 9063862.322 0.032 0.974
categoryEducation -5143240.331 9166792.282 -0.561 0.575
categoryEntertainment -2003594.778 8963351.009 -0.224 0.823
categoryFilm & Animation -2426443.611 9166881.194 -0.265 0.791
categoryGaming -3616205.562 9073397.471 -0.399 0.690
categoryHowto & Style -4393500.858 9579247.101 -0.459 0.647
categoryMovies -5226411.984 12346757.631 -0.423 0.672
categoryMusic -5400621.796 8978358.410 -0.602 0.548
categorynan 4461079.060 9087706.707 0.491 0.624
categoryNews & Politics -4810603.965 9588665.042 -0.502 0.616
categoryNonprofits & Activism -7024002.689 12389794.340 -0.567 0.571
categoryPeople & Blogs -2733019.699 8989387.334 -0.304 0.761
categoryPets & Animals 1338221.067 11323630.062 0.118 0.906
categoryScience & Technology -5132121.348 9613639.143 -0.534 0.594
categoryShows 805997.958 9635130.193 0.084 0.933
categorySports -2361036.909 9965156.952 -0.237 0.813
categoryTrailers -10812255.538 12348466.131 -0.876 0.382
video_per_upload -0.002 0.002 -1.054 0.292
uploads 25.544 16.026 1.594 0.112
video_views 0.001 0.000 15.541 0.000
mlr_model_1 %>% 
broom::glance() %>%
knitr::kable(digits=3)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.438 0.373 12009746 6.729 0 61 -10386.7 20899.4 21175.13 7.586709e+16 526 588

3. Fit the multiple linear regression model of highest_yearly_earnings

Model statement: \(highestyearlyearnings =\beta_0+ \beta_1country +\beta_2category +\beta_3videoperupload + beta_4uploads + beta_5(videoviews)\)

mlr_model_2 <- lm(highest_yearly_earnings ~ country + category + video_per_upload+uploads +video_views , data = youtube_df)

  youtube_df %>% 
  modelr::add_predictions(mlr_model_2) %>% 
  ggplot(aes(x = earning_differences, y = pred)) +
  geom_point() +
  labs(
        title = "Multivariate Linear Model",
        x = "highest_yearly_earnings",
        y = "predictions") +
  theme_pubclean()
<<<<<<< HEAD

check_model(mlr_model_2, check = c("linearity", "outliers", "qq", "normality"))

=======

check_model(mlr_model_2, check = c("linearity", "outliers", "qq", "normality"))

>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
# Summary of the model
mlr_model_2%>% 
broom::tidy() %>%
knitr::kable(digits=3)
term estimate std.error statistic p.value
(Intercept) 8014304.518 10263302.263 0.781 0.435
countryAustralia 3527574.695 10365112.296 0.340 0.734
countryBarbados -8457592.238 13434533.042 -0.630 0.529
countryBrazil -684530.551 4507957.563 -0.152 0.879
countryCanada -2491751.885 6674826.171 -0.373 0.709
countryChile -6380571.699 8600064.143 -0.742 0.458
countryChina -3752537.270 13934282.922 -0.269 0.788
countryColombia 1742683.485 5844773.275 0.298 0.766
countryCuba 34821010.475 50492710.081 0.690 0.491
countryEcuador -5060952.920 9914940.382 -0.510 0.610
countryEgypt -4991270.448 10096665.678 -0.494 0.621
countryEl Salvador -8021060.834 13533245.861 -0.593 0.554
countryFrance -5907274.618 7552940.238 -0.782 0.434
countryGermany 5389320.770 7534790.563 0.715 0.475
countryIndia -1169257.163 4051324.322 -0.289 0.773
countryIndonesia -3942928.257 4905029.749 -0.804 0.422
countryItaly 18932374.896 9908138.656 1.911 0.057
countryJapan 6513549.121 7510624.844 0.867 0.386
countryJordan -8168177.752 8384937.213 -0.974 0.330
countryKuwait 1045429.364 13534568.583 0.077 0.938
countryLatvia 40302216.073 13478120.831 2.990 0.003
countryMalaysia -4575492.584 13590537.998 -0.337 0.737
countryMexico -3738452.955 5022598.168 -0.744 0.457
countryNetherlands -5561365.107 9921314.070 -0.561 0.575
countryPakistan 7858054.990 6689844.459 1.175 0.241
countryPhilippines -7494453.788 6206972.420 -1.207 0.228
countryRussia -4099482.942 5321461.013 -0.770 0.441
countrySamoa -4543730.517 13487688.596 -0.337 0.736
countrySaudi Arabia -7538239.422 6618285.941 -1.139 0.255
countrySingapore -13992604.701 9945946.697 -1.407 0.160
countrySouth Korea 19806130.071 5815161.160 3.406 0.001
countrySpain -4250365.290 5616672.273 -0.757 0.450
countrySweden 1148466.395 9984181.056 0.115 0.908
countrySwitzerland -5922009.800 14250646.083 -0.416 0.678
countryThailand -9887305.924 5383299.283 -1.837 0.067
countryTurkey 9825286.892 8420332.253 1.167 0.244
countryUkraine -5490115.784 7038478.876 -0.780 0.436
countryUnited Arab Emirates 2867917.580 6439903.821 0.445 0.656
countryUnited Kingdom -4404027.123 4641735.920 -0.949 0.343
countryUnited States -2325658.305 4018277.003 -0.579 0.563
countryVenezuela -9915108.764 13500902.768 -0.734 0.463
countryVietnam -1125864.331 9884578.269 -0.114 0.909
categoryComedy 292233.597 9666599.929 0.030 0.976
categoryEducation -5503754.935 9776374.626 -0.563 0.574
categoryEntertainment -2155817.991 9559404.715 -0.226 0.822
categoryFilm & Animation -2608082.406 9776469.450 -0.267 0.790
categoryGaming -3874183.218 9676769.155 -0.400 0.689
categoryHowto & Style -4702253.317 10216257.270 -0.460 0.646
categoryMovies -5590571.371 13167804.430 -0.425 0.671
categoryMusic -5778060.854 9575410.093 -0.603 0.546
categorynan 4742611.242 9692029.940 0.489 0.625
categoryNews & Politics -5147695.864 10226301.494 -0.503 0.615
categoryNonprofits & Activism -7512888.934 13213703.036 -0.569 0.570
categoryPeople & Blogs -2930450.073 9587172.430 -0.306 0.760
categoryPets & Animals 1424822.996 12076639.919 0.118 0.906
categoryScience & Technology -5489495.414 10252936.348 -0.535 0.593
categoryShows 845790.093 10275856.530 0.082 0.934
categorySports -2534034.619 10627829.732 -0.238 0.812
categoryTrailers -11549784.310 13169626.543 -0.877 0.381
video_per_upload -0.002 0.002 -1.055 0.292
uploads 27.269 17.091 1.595 0.111
video_views 0.001 0.000 15.543 0.000
mlr_model_2 %>% 
broom::glance() %>%
knitr::kable(digits=3)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.438 0.373 12808382 6.731 0 61 -10424.56 20975.11 21250.85 8.629275e+16 526 588

4. Build Random Forest subscribers

library(randomForest)
library(caTools)
set.seed(42) # for reproducibility
split <- sample.split(youtube_df$subscribers, SplitRatio = 0.8)
train_data <- subset(youtube_df, split == TRUE)
test_data <- subset(youtube_df, split == FALSE)

# Train a Random Forest model
rf_modelS <- randomForest(subscribers ~ ., data=youtube_df, importance=TRUE)

# Make predictions on the test set
predictions <- predict(rf_modelS, test_data)

# Calculate R^2 score
r2_score <- cor(test_data$subscribers, predictions)^2

# Output the model and R^2 score
print(rf_modelS)
## 
## Call:
##  randomForest(formula = subscribers ~ ., data = youtube_df, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 5.343385e+13
##                     % Var explained: 85.25

We use the same predictors country, category, video_per_upload, uploads, video_views and build three models with response subscribers, earning_differences, highest_yearly_earnings. The respective \(r^2\) are \(0.754, 0.438, 0.438\), which means that, with the specified covariates, the model’s performance in predicting subscribers is superior to the others. To compare the performance with multiple linear regression model, we also trained a Random Forest model to predict the number of subscribers. The data was split into training data and test data with a proportion of 0.8. The \(r^2\) score of 0.977 significantly outperforms the result from the linear regression model, where \(r^2\) was 0.754. This suggests that the Random Forest model explains almost all the variability in the subscriber count. Based on these results, we conclude that the number of video views is also one of the most important factors for predicting the number of subscribers.

Discussion

We also check the linear regression assumptions (e.g., homoscedasticity, normality of residuals) and observe potential outliers that can impact the model’s accuracy. To enhance model accuracy, we may consider removing these outliers and subsequently explore missing value imputation. In summary, the Random Forest model proves highly effective for this dataset; however, further validation, potentially with different datasets or cross-validation, is advisable to ensure the model’s generalizability and assess the potential impact of overfitting.

Interpretation for model 1

  1. Model Formula:
    • The model predicts ‘subscribers’ based on ‘country’, ‘category’, ‘video_per_upload’, ‘uploads’, and ‘video_views’.
  2. Residuals:
    • The residuals’ section provides a five-number summary (minimum, 1st quartile, median, 3rd quartile, maximum) of the model residuals. Large values for the minimum and maximum suggest the presence of outliers.
  3. Coefficients:
    • The estimates for each predictor variable (country, category, video_per_upload, uploads, video_views) are provided along with their standard errors, t-values, and p-values.
    • Most of the country and category levels are not statistically significant predictors of subscribers (p > 0.05), except for ‘countrySouth Korea’ which is significant (p = 0.0126, indicated by *).
    • The variable ‘video_per_upload’ is not a significant predictor (p = 0.4483).
    • The variable ‘uploads’ is significant (p < 0.001, indicated by ***).
    • The variable ‘video_views’ is highly significant (p < 2e-16, indicated by ***), indicating a strong association with the number of subscribers.
  4. Model Summary:
    • Residual standard error: An estimate of the standard deviation of the residuals, which gives a measure of the typical size of the residuals.
    • Multiple R-squared: The proportion of variance in the dependent variable that is predictable from the independent variables (0.7526, or 75.26%).
    • Adjusted R-squared: Adjusted for the number of predictors in the model, providing a more accurate measure of model fit (0.724, or 72.4%).
    • F-statistic: A measure of how much the model improves the fit compared to a model with no predictors. The associated p-value (p < 2e-16) suggests the model as a whole is statistically significant.

Interpretation:

The model explains a significant amount of the variance in the number of subscribers (as indicated by the R-squared values). ‘video_views’ is particularly a strong predictor. However, most of the individual country and category predictors do not significantly contribute to the model. This might suggest that while overall video views are important, where those views come from (which country) and the content category may not be as important, with the exception of South Korea.

interpretation for model2

  1. Model Formula:
    • The model is predicting ‘earning_differences’ based on the independent variables ‘D’.
  2. Residuals:
    • The residuals’ section provides a five-number summary of the residuals from the model. The wide range of values from the minimum to the maximum indicates the presence of large residuals, suggesting that there may be outliers or that the model may not be adequately capturing the pattern in the data.
  3. Coefficients:
    • The coefficients represent the estimated change in ‘earning_differences’ for a one-unit change in the predictor, holding other variables constant.
    • Most of the country and category coefficients are not statistically significant (p > 0.05), which suggests they do not have a unique effect on the ‘earning_differences’ after accounting for other factors in the model.
    • ‘video_per_upload’ has a coefficient that is not statistically significant (p = 0.3893).
    • ‘video_views’ is highly significant (p < 2e-16, indicated by ***), indicating a strong association with ‘earning_differences’.
  4. Model Summary:
    • Residual standard error: An estimate of the standard deviation of the residuals.
    • Multiple R-squared: The proportion of variance in the dependent variable that is predictable from the independent variables (0.4386, or 43.86%).
    • Adjusted R-squared: Adjusted for the number of predictors in the model, which provides a more accurate measure of model fit (0.3736, or 37.36%).
    • F-statistic: A measure of the overall significance of the model. The associated p-value (p < 2e-16) suggests the model as a whole is statistically significant.

Interpretation:

The model explains a moderate amount of the variance in ‘earning_differences’. The variable ‘video_views’ stands out as a strong predictor. The country and category variables generally do not significantly predict ‘earning_differences’, with a few exceptions. Notably, the ‘countryLatvia’ coefficient is significant (p = 0.007876), suggesting it has a unique effect on ‘earning_differences’.

Recommendations:

  • Given the large residuals and the significance of some country coefficients, further investigation is warranted. You might want to look into possible outliers or influential points that could be affecting the model.
  • Consider exploring different transformations of the dependent and independent variables to achieve a better fit.
  • Simplify the model by potentially removing non-significant predictors, although the significance of the overall model suggests that at least some of the predictors are useful. The low significance of many individual predictors suggests that there might be multicollinearity or other issues affecting the estimates.
  • The presence of significant predictors like ‘video_views’ may warrant a closer look at potential non-linear relationships or interactions between variables that could improve model fit.

interpretation for model 3

The image you’ve uploaded shows the output from a Multiple Linear Regression (MLR) model in R, with ‘highest_yearly_earnings’ as the dependent variable. The model includes ‘country’, ‘category’, ‘video_per_upload’, and ‘video_views’ as independent variables. Here’s an interpretation of the output:

  1. Model Formula:
    • The model predicts ‘highest_yearly_earnings’ based on the independent variables mentioned.
  2. Residuals:
    • The residuals’ section provides a five-number summary of the model’s residuals. The large range suggests the presence of outliers or extreme values in the data.
  3. Coefficients:
    • The estimates for each predictor variable are given along with their standard errors, t-values, and p-values.
    • Most coefficients are not statistically significant, but there are a few exceptions:
      • ‘countryLatvia’ has a positive coefficient that is significant at the 0.01 level (p = 0.002892).
      • ‘countrySouth Korea’ has a negative coefficient that is significant at the 0.001 level (p = 0.000702).
    • ‘video_per_upload’ is not a significant predictor (p = 0.29118).
    • ‘video_views’ has a positive and highly significant coefficient (p < 2e-16), indicating a strong and statistically significant relationship with ‘highest_yearly_earnings’.
  4. Model Summary:
    • Residual standard error: An estimate of the standard deviation of the residuals.
    • Multiple R-squared: A measure of the proportion of variance in the dependent variable explained by the model (0.4387, or 43.87%).
    • Adjusted R-squared: Adjusted for the number of predictors; it provides a more accurate measure of the goodness of fit (0.3737, or 37.37%).
    • F-statistic: Reflects the overall significance of the model. A very low p-value (< 2.2e-16) indicates that the model is statistically significant.

Interpretation:

The model has a moderate explanatory power for ‘highest_yearly_earnings’, with ‘video_views’ being a particularly strong predictor. While most country and category variables are not significant on their own, the overall model is significant, suggesting that there is a combination of these variables that helps predict the highest yearly earnings. The significant predictors for ‘countryLatvia’ and ‘countrySouth Korea’ suggest that being in these countries is associated with a significant difference in ‘highest_yearly_earnings’ compared to the baseline country (not shown in the output, likely the reference category).